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ABSTRACT 


Speech is the most natural and easy method for people to 
communicate, and interpreting speech is one of the most 
sophisticated tasks that the human brain conducts. The goal of 
Speech Emotion Recognition (SER) is to identify human emotion 
from speech. This is due to the fact that tone and pitch of the voice 
frequently reflect underlying emotions. Librosa was used to analyse 
audio and music, sound file was used to read and write sampled 
sound file formats, and sklearn was used to create the model. The 
current study looked on the effectiveness of Convolutional Neural 
Networks (CNN) in recognising spoken emotions. The networks' 
input characteristics are spectrograms of voice samples. Mel- 
Frequency Cepstral Coefficients (MFCC) are used to extract 
characteristics from audio. Our own voice dataset is utilised to train 
and test our algorithms. The emotions of the speech (happy, sad, 
angry, neutral, shocked, disgusted) will be determined based on the 
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I. INTRODUCTION 

Speech emotion recognition (SER) is a technique that 
extracts emotional features from speech by analysing 
distinctive characteristics and the acquired emotional 
change. At the moment, voice emotion recognition is 
a developing artificial intelligence cross-field [1]. A 
voice emotion processing and recognition system is 
made up of three parts: speech signal acquisition, 
feature extraction, and emotion recognition. In this 
method, the extraction quality has a direct impact on 
the accuracy of speech emotion identification. In 
feature extraction, the entire emotion sentence was 
frequently used as a unit for feature extraction and 
extraction contents. The neural networks of the 
human brain are highly capable of learning high-level 
abstract notions from low-level information acquired 
by the sensory periphery. Humans communicate 
through voice, and interpreting speech is one of the 
most sophisticated operations that the human brain 
conducts. It has been argued that children who are not 
able to understand the emotional states of the 
speakers developed poor social skills and in some 
cases they show psychopathological symptoms [2, 3]. 
This highlights the importance of recognizing the 
emotional states of speech in_ effective 
communication. Detection of emotion from facial 
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expressions and biological measurements such as 
heart beats or skin resistance formed the preliminary 
framework of research in emotion recognition[4]. 


More recently, emotion recognition from speech 
signal has received growing attention. The traditional 
approach toward this problem was based on the fact 
that there are relationships between acoustic features 
and emotion. In other words, the emotion is encoded 
by acoustic and prosodic correlates of speech signals 
such as speaking rate, intonation, energy, formant 
frequencies, fundamental frequency (pitch), intensity 
(loudness), duration (length), and _— spectral 
characteristic (timbre) [5, 6]. There are a variety of 
machine learning algorithms that have been examined 
to classify emotions based on their acoustic correlates 
in speech utterances. In the current study, we 
investigated the capability of convolutional neural 
networks in classifying speech emotions using our 
own dataset. There are a variety of machine learning 
algorithms that have been examined to classify 
emotions based on their acoustic correlates in speech 
utterances. In the current study, we investigated the 
capability of convolutional neural networks in 
classifying speech emotions using our own dataset. 
The specific contribution of this study is using 
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wideband spectrograms instead of narrow-band _ revealed that wide-band spectrograms and data 
spectrograms as well as assessing the effect of data augmentation equipped CNNs to achieve the state-of- 
augmentation on the accuracy of models. Our results the art accuracy and surpass human performance. 


Feature Extraction —* Emotion recognition -— 


The model of envction 
Cescnghion 


Fig.1. Speech emotion recognition block diagram 


Il. RELATED WORK 

Most of the papers published in last decade use spectral and prosodic features extracted from raw audio signals. 
The process of emotion recognition from speech involves extracting the characteristics from a corpus of 
emotional speech selected or implemented, and after that, the classification of emotions is done on the basis of 
the extracted characteristics. The performance of the classification of emotions strongly depends on the good 
extraction of the characteristics (such as combination of MFCC acoustic feature with the energy prosodic feature 
[7]. Yixiong Pan in [8] used SVM for three class emotion classification on Berlin Database of Emotional Speech 
[9] and achieved 95.1% accuracy. 


Norooziet.al. Proposed a versatile emotion recognition system based on the analysis of visual and auditory 
signals. He used 88 features (Mel frequency cepstral coefficients 


(MFCC), filter bank energies (FBEs)) using the Principal Component Analysis (PCA) infeature extraction to 
reduce the dimension of features previously extracted revealed that wide-band spectrograms and data 
augmentation equipped CNNs to achieve the state-of-the art accuracy and surpass human performance. 


The performance of the classification of emotions strongly depends on the good extraction of the characteristics 
(such as combination of MFCC acoustic feature with the energy prosodic feature [7]. Yixiong Pan in [8] used 
SVM for three class emotion classification on Berlin Database of Emotional Speech [9] and achieved 95.1% 
accuracy. 


Norooziet.al. proposed a versatile emotion recognition system based on the analysis of visual and auditory 
signals. He used 88 features (Mel frequency cepstral coefficients (MFCC), filter bank energies (FBEs)) using the 
Principal Component Analysis (PCA) in feature extraction to reduce the dimension of features previously 
extracted [10]. S. Lalitha in [11] used pitch and prosody features and SVM classifier reporting 81.1% accuracy 
on 7 classes of the whole Berlin Database of Emotional Speech. Zamil et al also used the spectral characteristics 
which is the 13 MFCC obtained from the audio data in their proposed system to classify the 7 emotions with the 
Logistic Model Tree (LMT) algorithm with an accuracy rate 70% [12]. Yu zhou in [13] combined prosodic and 
spectral features and used Gaussian mixture model super vector based SVM and reported 88.35% accuracy on 5 
classes of Chinese-LDC corpus. 


H.M Fayek in [14] explored various DNN architecture and reported accuracy around 60% on two different 
database eENTERFACE [15] and SAVEE [16] with 6 and 7 classes respectively. Fei Wang used combination of 
Deep Auto Encoder, various features and SVM in [17] and reported 83.5% accuracy on 6 classes of Chinese 
emotion corpus CASIA. In contrast to these traditional approaches more novel papers have been published 
recently employing Deep Neural Networks into their experiments with the promising results. Many authors 
agree that the most important audio characteristics to recognize emotions are spectral energy distribution, Teager 
Energy Operator (TEO) [18], MFCC, Zero Crossing Rate (ZCR), and the energy parameters of the filter bank 
energies (FBEs) [19]. 


HW. TRADITIONAL SYSTEM 

The traditional system was based on the analysis and comparison of all kinds of emotional characteristic 
parameters, selecting emotional characteristics with high emotional resolution for feature extraction. In general, 
the traditional emotional feature extraction concentrates on the analysis of the emotional features in the speech 
from time construction, amplitude construction, and fundamental frequency construction and signal feature [28]. 
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IV. PROPOSED METHOD 
Convolutional Neural Network (CNN) is used to classify the emotions (happy, sad, angry, neutral, surprised, 
disgust) and to predict the output by showing its accuracy. 


The given speech is plotted as spectrogram by using matplot library and this is used as input for CNN to build 


the model. 
Cotlecting dataset 


Feature Selection 


Feature Engineering 


Feature Subset 


Classification Model 


+ 


Evaluation of results 


Fig.2. Flow diagram of proposed system 


A. Data Set Collection 

The first step is to create an empty dataset that will hold the training data for the model. After creating an empty 
dataset, the data’s (audio) have to be recorded and labeled in different classes. Once the labeling is done, the 
data’s have to be preprocessed which will produce the clear pitch of the data by removing its unwanted 
background noise. After preprocessing the data’s are classified into train dataset and test dataset, where the train 
dataset hold 75% of the data and the test dataset holds 25% of the data. 


B. Feature Extraction of Speech Emotion 

Human speech consists of many parameters which show the emotions compromise in it. As there is change in 
emotions these parameters also gets changed. Hence it’s necessary to select proper feature vector to identify the 
emotions. Features are categorized as excitation source features, spectral features, and prosodic features. 
Excitation source features are achieved by suppressing characteristics of vocal tract (VT). Spectral features used 
for emotion recognition are linear prediction coefficients (LPC), Perceptual Linear prediction coefficients 
(PLPCs), Mel-frequency cepstral coefficients (MFCC), linear prediction cepstrum coefficients (LPCC), and 
perceptual linear prediction (PLP). The accuracy of differentiating different emotions can be achieved by using 
MECC, LFPC 


[20, 21]. 


C. Mel-Frequency Cepstral Coefficients 
The Mel-Frequency Cepstral Coefficients (MFCC) feature extraction method is a leading approach for speech 
feature extraction. The various steps involved in MFCC feature extraction are: 


[ A Oe ee Ss FRe-L Pre-emphasics 


Mie! Giterharnk 


Feature Trams form oynarmbe featares 


Fig.3. Flow of MFCC A/D conversion: 


This converts the analog signal into discrete space. 
Pre-emphasis: 
This boosts the amount of energy in the high frequencies. 
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Windowing: 
Windowing involves the slicing of audio waveform into sliding frames. Discrete Fourier Transform: 


DFT is used to extract information in the frequency domain [22, 23]. 


D. Classifiers 

After extracting features of speech, it is essential to select a proper classifier. Classifiers are used to classify 
emotions. In the current study, we use Convolutional Neural Network (CNN). The term Convolutional comes 
from the fact that Convolution-the mathematical operation is employed in these networks. Convolutional Neural 
Networks is one of the most popular Deep Learning Models that have manifested remarkable success in the 
research areas. CNN is a deep learning algorithm that takes image as an input, assign importance to various 
aspects in the image and will be able to differentiate from other. Generally CNNs have three building blocks: the 
convolutional layer, the pooling layer, and the fully connected layer. Following, we describe these building 
blocks along with some basic concept such as soft max unit, rectified linear unit, and drop out. 


> Input layer: This layer holds the raw input image. 


> Convolution Layer: This layer computes the output volume by computing dot product between all filters 


and image patch. 


> Activation Function Layer: This layer will apply element wise activation function to the output of 


convolution layer. 


> Pool Layer: This layer is periodically inserted in CNN and its main function is to reduce the size of volume 
which makes computation fast and reduces memory. The two types are Maxpooling and average pooling. 


>  Fully-Connected Layer: This layer takes input from the previous layer and computes the class scores and 
outputs the 1-D array of size equal to the number of classes [24, 25]. 


Conroatuticen 


Feature Ex tractiom 


Fig.4. CNN Algorithm 


V. APPLICATION 

The applications of speech emotion recognition 
system are, psychiatric diagnosis, conversation with 
robots, intelligent toys, mobile based emotion 
recognition, emotion recognition in call centre where 
emotions of customer can be identified and can help 
to get better service quality, intelligent tutoring 
system, lie detection, games[26,27]. It is also used in 
healthcare, Psychology, cognitive science and 
marketing, voice-based virtual assistants. 


VI. CONCLUSION 

In this research, we suggested a technique for 
extracting the emotional characteristic parameter 
from an emotional speech signal using the CNN 
algorithm, one of the Deep Learning methods. 
Previous research relied heavily on narrow-band 
spectrograms, which offer better frequency resolution 
than wide-band spectrograms and can discern 
individual harmonics. Wide-band spectrograms, on 
the other hand, offer better temporal resolution than 


Faulibyr 
CO munvescit ect 


Classificatiom 


narrow-band spectrograms and reveal distinct glottal 
pulses that are connected with basic frequency and 
pitch. On training data, CNNs perform admirably. 
The current study's findings demonstrated CNNs' 
ability to learn the fundamental emotional properties 
of speech signals from their low-level representation 
utilising wide-band spectrums. 


Vil. FUTURE SCOPE 

For future work, we suggest to use audio-visual 
database or audio-visual-linguistic databases to train 
Deep Learning models where facial expressions and 
semantic information are taken into account as well as 
speech signals, which allows improving the 
recognition rate of each emotion. In future, we can 
think about using other types of features and apply 
our system on other bases that are larger and used 
other method for feature extraction. 
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